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Abstract 

A  single  Red  wishes  to  shoot  at  a  collection  of  Blue  targets  in  order  to  maximise 
some  measure  of  return  obtained  from  Blues  killed  before  Red’s  own  demise.  While 
the  class  of  decision  processes  called  multi-armed  bandits  has  been  previously  deployed 
to  develop  optimal  policies  for  Red,  we  argue  the  importance  of  a  little  known,  but 
more  general  class  of  bandit  processes  introduced  by  Nash  (1980).  In  particular,  the 
deployment  of  this  class  of  processes  will  enable  Red  to  take  account  in  a  natural  way  of 
the  relative  threats  posed  to  his  own  survival  in  taking  targetting  actions.  We  develop 
optimal  shooting  policies  for  Red  in  the  context  of  a  range  of  models  which  are  of 
independent  interest.  The  paper  concludes  with  a  numerical  study. 


1  Introduction 

A  multi-armed  bandit  problem  arises  when  a  single  key  resource  is  available  for  allocation 
to  a  fixed  collection  of  projects  or  bandits.  These  projects  evolve  stochastically  while  in 
receipt  of  service  (i.e.  while  the  resource  is  allocated  to  them)  and  earn  state  dependent 
returns  as  they  do  so,  but  remain  fixed  (and  earn  nothing)  otherwise.  Gittins  and  Jones 
(1974)  elucidated  the  optimality  of  index  policies  for  certain  classes  of  multi-armed  bandit 
problems.  Such  policies  attach  a  calibrating  index  to  each  project,  a  function  of  that  project’s 
state,  and  choose  at  each  decision  epoch  to  allocate  resource  to  whichever  project  has  the 
largest  associated  index.  See  also  Gittins  (1989).  An  extensive  literature  exists  outlining 
a  range  of  extensions  and  developments  of  Gittins’  classical  work  while  various  schemes  for 
index  computation  have  been  proposed.  See,  for  example,  Whittle  (1980),  Weber  (1992), 
Katehakis  and  Veinott  (1987)  and  Bertsimas  and  Niiio-Mora  (1996). 
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Recently,  Glazebrook  and  Washburn  (2004)  have  discussed  the  utilisation  of  the  multi- 
armed  bandit  framework  and  the  associated  index  policies  to  develop  optimal  shooting  poli¬ 
cies.  Here  the  “key  resource”  is  a  single  shooter  (Red)  and  the  “projects”  form  a  fixed 
collection  of  targets  (Blue).  Red’s  goal  is  to  so  target  the  Blues  as  to  maximise  the  expected 
number  (or  value)  of  kills  achieved.  Manor  and  Kress  (1997)  had  previously  utilised  the  the¬ 
ory  of  multi-armed  bandits  to  analyse  a  shooting  problem  in  which  Red  receives  incomplete 
information  regarding  the  outcome  of  successive  shots.  If  a  shot  is  unsuccessful  (the  Blue 
target  is  not  killed)  then  Red  receives  no  feedback,  while  if  the  target  is  killed,  that  fact  is 
confirmed  to  Red  with  probability  less  than  one.  Manor  and  Kress  (1997)  demonstrate  the 
optimality  of  a  form  of  index  policy  (the  greedy  shooting  policy)  for  their  setup.  Barkdoll 
et  al.  (2002)  develop  index  policies  for  Red  in  situations  in  which,  not  only  must  he  choose 
which  Blue  to  target,  but  also  how  to  operate  engagement  radar  in  support  of  each  shot. 

In  a  little  known  early  development  of  Gittins’  work,  Nash  (1980)  elucidated  the  optimal¬ 
ity  of  index  policies  for  a  class  of  generalised  bandits  in  which  a  form  of  reward  dependence 
is  induced  between  the  constituent  projects  or  bandits  via  a  multiplicatively  separable  struc¬ 
ture.  Subsequent  theoretical  developments  of  Nash’s  work  include  those  of  Fay  and  Walrand 
(1991)  and  Crosbie  and  Glazebrook  (2000  a,b).  A  general  methodology  for  the  computar- 
tion  of  Nash’s  indices  may  be  found  in  Glazebrook  and  Greatrix  (1995).  The  prime  aim  of 
the  current  paper  is  to  argue  the  importance  of  this  class  of  generalised  bandits  and  their 
associated  index  policies  for  the  analysis  of  shooting  problems.  As  we  shall  see,  they  are 
especially  effective  in  situations  in  which  Red’s  engagement  with  the  enemy  puts  him  in 
danger  and  that  the  level  of  danger  may  differ  according  to  which  Blue  he  targets.  In  such 
situations.  Red’s  objective  becomes  the  maximisation  of  the  expected  number  (or  value)  of 
Blues  killed  before  he  himself  is  destroyed.  Now  Red’s  shooting  policy  must  balance  the 
returns  obtainable  from  his  options  against  the  respective  dangers  they  pose.  The  index 
policies  we  develop  elucidate  how  this  balance  should  be  struck. 

Consider,  for  example,  a  military  scenario  discussed  by  Barkdoll  et  al.  (2002)  which  is 
asymmetric  between  the  enemy  forces.  Blue  has  established  air  superiority  in  some  region 
and  Red  is  a  surface-to-air  missile  (SAM)  seeking  to  disrupt  Blue’s  air  campaign.  The  Joint 
Chiefs  of  Staff  use  the  terms  “reactive”  or  “opportune”  suppression  of  enemy  air  defences 
(SEAD).  In  Barkdoll  et  al.  (2002)  every  Red  shot  at  Blue  exposed  him  to  danger  from  a 
stand-off  Blue  shooter.  Moreover,  in  such  a  situation  the  level  of  danger  to  the  Red  SAM 
may  vary  according  to  the  Blue  targets  he  chooses.  For  example,  shooting  at  longer  range 
puts  Red  at  greater  risk  to  anti-radiation  missile  (ARM)  attack  from  Blue  since  the  SAM  will 
need  to  radiate  longer  to  guide  the  missile  to  its  target.  We  introduce  models  and  analyses 
appropriate  for  situations  in  which  Red’s  optimal  shooting  policy  needs  to  take  account  of 
such  risks  to  himself. 

The  paper  is  structured  as  follows.  Section  2  presents  a  general  class  of  shooting  problems 
in  the  form  of  generalised  bandit  problems.  We  also  describe  the  nature  of  the  optimising 
index  policies.  Each  of  Sections  3-5  feature  a  particular  model  along  with  details  of  the 
corresponding  optimal  shooting  policy.  Each  of  these  is  of  independent  interest.  Model  1 
(in  Section  3)  is  a  Bayesian  model  in  which  Red  is  able  to  learn  about  the  (true)  identity 
of  the  Blues  he  faces  as  the  engagement  proceeds.  Model  2  (in  Section  4)  allows  for  par¬ 
tial/cumulative  damage  to  each  target,  while  Model  3  (in  Section  5)  extends  Model  1  in 
allowing  Red  to  supplement  the  information  he  has  about  the  Blues  he  faces  by  “looking” 
(imperfectly)  at  the  most  recently  targetted  Blue  after  each  shot.  In  Section  6,  we  exemplify 
the  performance  of  the  index  policies  developed  in  a  numerical  study.  We  conclude  in  Section 
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7  with  a  brief  discussion  of  possible  extensions  and  of  issues  faced  by  the  Blue  force. 


2  A  General  Model 

A  shooter  Red  has  to  plan  a  series  of  engagements  with  N  Blues.  A  single  engagement  must 
include  a  shot  by  Red  at  a  targetted  Blue  and  may  expose  Red  to  the  possibility  of  being 
killed  himself.  An  engagement  may  also  incorporate  a  look  by  Red  to  gain  information  on 
the  state  of  the  Blue  targetted  after  he  has  delivered  his  shot.  Red  is  assumed  to  have  an 
infinite  supply  of  bullets.  His  decision  problem  concerns  the  choice  of  which  Blue  to  engage 
next  on  the  basis  of  his  observational  history  of  past  engagements  to  date.  Red’s  goal  is  to 
maximise  his  expected  return  from  engagements  until  he  himself  is  destroyed.  Red’s  decision 
problem  is  modelled  as  a  Markov  decision  problem  {{Qj,Uj,Pj,Rj,Qj,P),  1  <  j  <  N}  as 
follows: 

(i)  X(t)  =  {Xi(t),X2{t), . . . ,  Xjv(t)}  denotes  the  state  of  the  system  at  time  t  =  0, 1>  2, . . . 
(i.e.  before  the  (t  +  1)®*  engagement)  and  Xj{t)  is  the  state  of  Blue  j.  We  require  that 
Xj{t)  G  Oj  U  {uJj},  where  Qj  is  the  space  of  possible  descriptors  of  Red’s  knowledge  of 
Blues  j’s  status,  while  Xj{t)  =  ojj  indicates  that  by  time  t,  Red  has  been  killed  during 
an  engagement  in  which  he  shot  at  Blue  J ; 

(ii)  At  each  f  =  0, 1, 2, . . .,  if  Red  is  still  alive  he  must  choose  one  of  the  actions  ai,  a2, . . . ,  aj^. 
Choice  of  Uj  means  that  Red’s  (t  +  1)®''  engagement  will  target  Blue  j; 

(iii)  If  action  aj  is  chosen  at  t  then  Red  observes  a  Markovian  change  of  Blue’s  state 

Xj{t)  — >  +  1).  We  write 

=  ■P{A(*  +  1)  =  y\Xj{t)  =  x,aj],  x,y  e  Q.j  U  {luj}. 

Note  that  flj  may  contain  a  state  iSj  indicating  that  Blue  is  dead  and  that  a  still  alive 
Red  knows  this.  In  such  cases,  both  oJj  and  ujj  are  absorbing  states  under  Markovian 
law  Pj.  Note  that  when  action  aj  is  chosen  at  t  then  Xit{t)  =  Xk{t  +  1),  k  ^  j) 

In  order  to  write  down  expressions  for  expected  rewards,  we  shall  suppose  that  an  infinite 
string  of  members  of  {oi,  02, . . . ,  <ijv}  are  chosen  and  consequential  system  state  changes  (as 
in  (iii))  observed  but  that  rewards  can  only  be  collected  while  Red  is  still  alive.  To  this  end, 
we  introduce  bounded  functions  Rj,Qj  and  Rj,  all  from  %  U  {uij}  to  R+.  The  quantity 
Rj(x)  is  the  expected  return  secured  when  action  Uj  is  taken  at  t  and  Xj{t)  =  x.  Function 
Qj  is  an  indicator  such  that 


Qj(x)  = 


1,  X  &  flj 
0,  X  =  bJj-, 


and  Rj  is  the  product  RjQj. 

Should  action  aj  be  taken  at  t,  the  return  generated  by  the  ensuing  engagement  is  written 

N 

P‘Rj{Xj{t)}  n  Q,{X,{t)}  =  p‘Rj{Xjm  n  Qk{X,{t)}.  (1) 

k=l  k^j 
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The  Q-multiplicative  term  in  (1)  ensures  that  no  rewards  are  earned  beyond  any  point  at 
which  Red  has  been  killed.  The  quantity  /3  6  (0, 1)  is  a  discount  factor  and  is  included 
for  generality.  Provided  we  make  natural  model  assumptions  which  guarantee  that  shooting 
stops  (with  Red  dead  or  all  Blues  dead)  after  a  finite  number  of  engagements  almost  surely 
then  we  may  also  take  /3  =  1  in  what  follows  and  consider  undiscounted  returns. 

A  policy  is  a  rule  for  choosing  actions  at  each  t  =  0, 1, 2, ...  in  light  of  the  history  of 
the  process  to  date.  Under  policy  p,  use  u{t)  for  the  choice  made  at  t.  We  write  the  total 
expected  return  under  policy  p  as 

(00 

t=0 

The  goal  is  to  find  policy  p*  to  maximise  the  expected  return  in  (2).  The  above  is  in  the 
class  of  Markov  decision  models  called  generalised  bandits  introduced  by  Nash  (1980).  These 
models  extend  the  multiarmed  bandits  of  Gittins  (1979,1989)  by  allowing  a  reward  interde¬ 
pendence  between  the  decision  options,  as  expressed  in  the  multiplicatively  separable  form 
to  be  found  in  (2).  The  theory  of  this  class  of  processes  has  been  developed  in  Glazebrook 
(1993),  Glazebrook  and  Greatrix  (1995)  and  Crosbie  and  Glazebrook  (2000  a,b).  For  our 
purposes,  the  key  fact  is  that  for  the  class  of  problems  outlined  in  (i)  -  (iv),  there  exists  an 
optimal  policy  of  index  form.  This  is  expressed  in  Theorem  1. 


Theorem  1  (Nash(1980))  There  exist  functions  Gj  :  — »  K+  such  that,  if  Red  is  still 

alive  at  t  then  he  optimally  engages  any  Blue  j*  for  which 

G,-.{2f,.(t)}=  m^  G^{X,(t)}.  (3) 


The  indices  in  (3)  are  broadly  of  Gittins  type.  To  develop  index  Gj{x)  for  some  x  €  fl;, 
suppose  that  at  f  =  0,  Blue  j  is  in  state  x  and  is  engaged  continuously  by  Red.  Let  r  be  a 
positive  valued  stopping  time  on  the  resulting  process  {Xj{t),t  >  0}  which  evolves  from  x 
according  to  the  Markov  law  Pj.  Use  Rj{x,T)  for  the  expected  return  earned  during  [0,r) 
as  expressed  by 


Rj{x,  t)  —  E 


'r-1 


^/3‘R,.{X^(t)}|X^(0) 

.f=0 


(4) 


and  rewards  are  automatically  terminated  at  Red’s  death.  Develop  a  corresponding  reward 
rate-like  measure  as 


G^(x,r)  =  R,(x,r)(l  -  E[/3-Q,.{Xj(r)}|X,.(0)  =  x])~i. 
The  index  Gj{x)  is  the  largest  such  reward  rate,  namely 

Gj[x)  =  supGj(x,r). 

r 


(5) 

(6) 


A  general  methodology  for  index  computation  may  be  found  in  Glazebrook  and  Greatrix 
(1995). 

We  now  present  three  particular  models,  each  of  which  illustrate  and  present  salient 
features  of  combat  scenarios.  In  two  cases,  the  indices  which  determine  optimal  engagement 
policies  for  Red  are  obtained  in  closed  form.  In  the  more  complex  “shoot-look-shoot”  setup 
of  Model  3,  we  give  an  algorithm  for  index  development. 
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3  Model  1  -  Red  learns  about  the  nature  of  Blue  tar¬ 
gets 


While  the  general  scenario  (Red  facing  N  Blues)  is  as  above,  we  shall  particularise  to  Model  1 
in  supposing  that  Blues  come  in  B  types  and  Red  has  imperfect  information  about  the  Blues 
he  is  facing.  Note  that  “type”  designation  here  may  reflect  any  Blue  characteristics  which 
are  relevant  to  determining  outcomes  as  the  conflict  proceeds.  Rod’s  uncertainty  about  Blue 
is  expressed  through  N  independent  prior  distributions  fl',  If^, . . . ,  If^  which  summarise  his 
beliefs  before  shooting  starts.  Hence  Hj  is  the  probability  that  Red  assigns  to  the  event 
“Blue  number  j  is  of  type  b” ,  1  <  j  <  N,  1  <  b  <  B.  At  each  time  t  =  0, 1, 2, . . .  Rod 
targets  a  single  Blue  and  shooting  continues  until  either  Red  is  dead  or  all  the  Blues  are. 
Conditional  upon  the  event  that  a  Blue  targetted  by  Red  is  actually  of  type  b,  Red  has  a 
probability  rj  of  killing  Blue  while  there  is  a  probability  that  he  himself  is  killed  during  the 
engagement.  Red  always  has  perfect  information  about  whether  each  Blue  is  alive  or  dead 
and  hence  the  model  calls  for  the  inclusion  of  state  bJj  within  flj  as  mentioned  in  Section 
2(iii)  above.  All  shooting  outcomes  are  assumed  independent.  Should  Red  kill  a  type  b  Blue 
with  his  shot  then  he  receives  a  return  Red’s  goal  is  to  maximise  the  expected 

return  from  Blues  killed  prior  to  his  own  destruction.  The  expectation  concerned  is  taken 
both  with  respect  to  Red’s  prior  beliefs  as  well  as  over  realisations  of  the  process.  Note  that 
the  choices  13=1,  Rj  =  1,  1  <b  <  B,  lead  to  a  maximisation  of  the  number  of  Blues  killed 
before  Red’s  death. 

A  crucial  feature  of  the  model  concerns  Red’s  capacity  to  update  his  beliefs  about  the 
Blues  he  is  facing  in  the  light  of  past  engagements  by  using  Bayes’  Theorem.  In  particular, 
if  Blue  j  has  been  targetted  in  n  engagements  and  he  and  Red  have  survived  them  all  (note 
that  these  are  the  only  event  types  of  relevance  for  future  decision-making)  then  the  posterior 
distribution  H^’"  summarising  Red’s  updated  beliefs  about  Blue  j  is  given  by 


iir 


ui{i-nni-9,r 


l<b<B. 


(7) 


For  notational  simplicity,  we  shall  refer  to  the  denominator  in  (7)  as  Dj(lV,n). 

This  problem  may  be  represented  within  the  general  formulation  of  Section  2 
follows: 


;i)-(iv) 


(i)  State  space  Qj  is  taken  to  be  N  U  {ujj}-  If  -^^(f)  =  n  e  N  then  at  time  t,  Blue  j  has 
been  targetted  in  n  engagements  with  Red,  all  of  which  have  been  inconclusive  (neither 
killed). 

(iii)  Should  action  Uj  be  chosen  at  t  when  Xj{t)  =  n  then,  following  the  resulting  engage¬ 
ment  a  transition  to  Xj(t  -|-  1)  occurs  according  to  Markovian  law  Pj  where 

Pj{n,  n  -f  1)  =  P(neither  Red  nor  Blue  j  killed)  =  Dj(TP,n  +  l)/Dj{W ,n)-, 


P{n,ujj)  =  P(Blue  j  killed  but  not  Red)  =  /Dj{W ,n), 

b=l 

and 

B 

P(n,ojj)  =  P(Red  killed)  =  -  nT(l  -  ,n). 
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The  expected  return  (undiscounted)  from  the  engagement  in  (iii)  above  is  given  by 

B 

-  nY{i  -  new. 

6=1 

In  following  the  prescription  for  index  computation  at  the  end  of  Section  2,  note  that  in 
taking  the  supremum  in  (6),  we  may  restrict  to  stationary  stopping  times  i.e.,  those  which 
stop  the  process  {Xj{t),t  >  0}  upon  entry  into  a  fixed  stopping  set.  Hence,  we  consider  the 
computation  of  index  Gj{n],  appropriate  for  Blue  j  in  state  n  e  N.  Specify  positive  integer 
r  and  write  %  for  Red’s  stopping  time  in  which  from  time  0  (at  which  point  Xj  (0)  =  n), 
Red  has  r  further  engagements  which  target  Blue  j  unless  one  or  other  of  them  is  destroyed 
first.  The  random  variable  is  the  number  of  shots  from  Red  which  results  from  this,  and 
cannot  exceed  r  or  be  less  than  one.  The  expected  return  Rj{n,Tr)  obtained  by  Red  from 
these  engagements  is  given  by 

ni{i-nni-e,y 

b=l 


Rjin,Tr)  =  '^ 


-  r6)”(i  -  06)4  /Oj(n^n),  (8) 

s=0  J 


while  we  also  have 


B 

=n]  =  -  n)"(i  -  9,r 

X  -  nni  -  d,y^-^+ir(i  -  nfii  -  06)^1  /Djinyn).  (9) 

Prom  (5),  (6),  (8)  and  (9)  and  Theorem  1  we  deduce  the  following: 


Theorem  2  If  Red  is  still  alive  at  t  then  he  optimally  targets  any  Blue  j*  for  which  Xj-  (t)  ^ 
CJj.  and  such  that 

G,.{X,..(t)}  =  maxG,-{X^(t)}, 

3 


where  the  maximisation  is  over  those  j  for  which  Xj{t)  y  u)j  and  where 


Gj{n)  =  max 

r>l 


Ef=inj(i-r6)-(i-06)’ 


-r6)»(l-06)»} 


E6=in^6(i-’'^)"(i- 


06)^1  -  Mr)  -  F2i{r)} 

n  e  N,  1  <  i  <  iV, 


(10) 


where 

t— 1 

P’i6(r)  =  ^/?“+V6(l  -  ri,)‘{l  ~  r>l,  l<b<B, 

s=0 

and 


F2t{r)  =  jf{l  -  r6)"(l  -  06)",  r  >  1,  1  <  6  <  B. 
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In  order  to  understand  index  structure,  introduce  the  “one-step  index”  Hj(n)  obtained 
by  taking  r  =  1  in  (10)  as 

H  (r,)  = 

'  Ef=inUi-n)"(i-Wi-/?  +  W 

It  is  straightforward  to  establish  the  following: 

(a)  If  Hj{n)  is  decreasing  in  n  then  the  maximum  in  (10)  is  attained  at  r  =  1  for  all  n 
and  it  then  follows  that  Gj{n)  =  Hj{n),  n  6  N.  If  this  behaviour  holds  good  for  all 
Blues  then  Red’s  optimal  shooting  policy  is  quasi- myopic  (a  one-step  look  ahead  rule). 
Here  indices  decrease  through  to  Blue’s  destruction  and  consequently  the  optimal  index 
policy  will  tend  to  involve  Red  making  frequent  changes  to  the  Blue  targetted; 

(b)  If  Hj{n)  is  increasing  in  n,  then  the  maximum  in  (10)  is  attained  for  £ill  n  in  the  limit 
as  r  — >  oo.  When  this  happens  the  index  Gj{n)  will  take  the  form 

Ef=i  n;(i  -  ^6)"(i  -  ebTRbMi  -  p{i  -  n)(i  - 

Ef=i  ^6(1  -  ^0”(i  -  e6)"{(i  -  /3  +  mil  -  m  -  n)(i  -  e^)]-^ 

and  will  be  increasing  in  n.  If  this  behaviour  holds  good  for  all  Blues  then  Red,  will  in 
an  optimal  policy,  persist  in  targeting  individual  Blues  in  turn  until  each  is  destroyed. 

(c)  If  there  are  just  two  Blue  types  {B  =  2),  then  it  can  be  shown  that  one  of  the  cases 
described  in  (a)  and  (b)  must  hold  for  each  Blue  target. 


Lemma  3  7/5  =  2  then  either  each  Hj[n)  is  increasing  in  n  or  each  Hj{n)  is  decreasing 
in  n. 

Proof  It  is  straightforward  to  show  algebraically  that,  for  any  j  and  n  €  N 

Hj(n)  >  Hj{n  +  1) 

if  and  only  if 

[{1-15  +  /30i}7?2r2  -  {1  -  /3  +  /ie2}I^in](l  -  ri)(l  -  ^i) 

>[{!-/?  +  /3@i}R2r2  -  {1  -  +  /?02}5iri](l  -  r2)(l  -  @2). 

This  condition  depends  upon  neither  j  nor  n.  The  result  follows.  □ 


Comments 

The  one-step  index  77,(n)  in  (11)  may  be  thought  of  (somewhat  crudely)  as  a  weighted 
average  (with  respect  to  the  posterior  distribution)  of  a  return/ exposure  index 

77i,ri,{l-/3-f 

for  Blues  of  type  b.  This  index  is  high  when  R^,  and  ri,  are  large  and  when  61,  is  small. 
It  is  plainly  such  Blue  types  which  Red  should  target  early.  Note  the  dependence  of  this 
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quantity  on  9i,-  Plainly,  Red  should  avoid  targetting  Blues  with  large  associated  ^-values  as 
such  engagements  are  high  risk  for  him  and  his  early  demise  will  preempt  the  possibility  of 
accumulating  further  returns. 

The  intuition  behind  Lemma  3  is  that  when  B  =  2,  then  for  a  specific  Blue,  as  the 
number  of  inconclusive  engagements  increases,  the  balance  of  Red’s  beliefs  about  that  Blue 
will  move  systematically  either  from  it  being  of  type  1  toward  it  being  type  2  or  in  the 
opposite  direction.  One  of  these  directions  will  yield  an  increasing  index  and  one  a  decreasing 
index,  depending  on  whether  type  1  or  type  2  has  the  larger  return/exposure  index. 

4  Model  2  -  Red  inflicts  accumulating  damage  upon 
Blue 

The  model  discussed  here  is  rather  different  in  character  to  those  of  Sections  3  and  5.  While 
there  is  now  no  Bayesian  learning  for  Red,  we  do  allow  the  N  Blues  targetted  by  Red  to 
suffer  accumulating  damage  during  successive  engagements.  This  is  a  step  in  the  direction  of 
shooting  problems  with  targets  whose  characteristics  evolve  dynamically.  See  the  comments 
in  Section  7(b).  We  shall  here  make  the  simplifying  assumption  that  an  engagement  consists 
of  a  shot  by  Rod  at  Blue  j,  say,  followed  by  a  retaliatory  strike  from  the  Blue  targetted. 
Further,  a  severely  damaged  Blue  will  be  less  lethal  to  Red.  Should  a  Blue’s  damage  be 
sufficient,  it  is  deemed  to  have  been  killed.  To  express  this,  we  assume  that  each  Blue  can  be 
in  any  one  of  K  states,  labelled  {1,2,...,  K}  and  that  this  state  is  observable  without  error 
by  Red.  As  state  k  runs  from  1  to  A"  it  represents  increasing  degrees  of  damage  with  K  =  oJj 
corresponding  to  Blue’s  death.  The  Markovian  law  determines  how  Blue  j  evolves  to 
higher  damage  states  under  successive  attacks  from  Red,  while  6j{k)  is  the  probability  that 
Blue  j  can  kill  Red  with  a  shot  when  in  damage  state  k,  where  P^f  =  0,  I  <  k,  and  9j(K)  =  0. 

The  general  formulation  of  Section  2  (i)-(iv)  should  be  adapted  to  this  case  as  follows: 

(i)  State  space  flj  is  {1, 2, ... ,  K}  with  K  =  u}j. 

(iii)  Should  action  aj  be  chosen  at  t  when  Xj{t)  =  k  e  {1,2, . . . ,  K  —  1}  then,  following 
the  resulting  engagement  between  Red  and  Blue  j  a  transition  to  Xj{t  +  1)  occurs 
according  to  Markovian  law  Pj  where 

Pj{k,  1)  =  P(engagement  inconclusive,  with  Blue’s  damage  k  ^  1) 

=  Pl,{l-0j{l)},  k<l<K-l-, 

Pj{k,lOj)  =  P(Blue  killed  but  not  Red)  =  P^^^; 
and 

if-i 

Pj{k,tOj)  =  P(Red  killed)  = 

l=k 

(iv)  The  expected  return  from  the  engagement  in  (iii)  above  is  given  by 

Rj{k)  =  pR^Pi^,  ke{l,2,...,K-l}, 
where  we  assume  that  the  reward  Rj  is  received  when  Blue  j  enters  state  K. 


We  consider  the  computation  of  index  Gj  (k),  appropriate  for  calibrating  Blue  j  when  in  state 
k  &  {1,2, . . .  ,K  —  1}.  To  this  end,  suppose  that  Wj(0)  =  k  and  that  Blue  j  is  subjected  to 
successive  engagements  with  Red,  Any  stationary  positive-valued  stopping  time  t  on  Blue’s 
evolving  state  corresponds  to  a  choice  of  subset  S{k)  C  {k,k  +  1, . . .  ,K  —  1}  such  that 

T  =  min[t;  t  >  0  and  Xj(t)  6  S{k)  U  {uij}  U  {w,}].  (12) 

The  expected  return  Rj(k,  r)  obtained  by  Red  in  all  engagements  with  Blue  j  up  to  stopping 
time  r  is  given  by 

Rj{k,T)  =  RjZl{S{k)}-, 

where  the  quantities  {Zf{S{k)},  1  <l  <  K  —  1}  satisfy  recursions 

^nS{k)}  =  PPL  +  l^  E  PL{^-^Am)}Zi{S(k)},  1<1<K-1. 

rn^S{k) 

The  corresponding  reward  rate  from  (5)  is  given  by 

G,ik,T)=R,zi{s{k)}ii^zi{sm-^- 

In  Theorem  4,  we  use  . *■"^1  for  the  collection  of  subsets  of  {fc,  A:  +  1, . . . ,  if  —  1}. 


Theorem  4  If  Red  is  still  alive  at  t  then  he  optimally  targets  any  Blue  j*  for  which  Xj»  {t) 
tJj*  and  such  that 


=  ramGj{Xj{t)} 

where  the  maximisation  is  over  those  j  for  which  Xj{t)  ^  Wj  and  where 

Gi(k)  =  r^x{RX{S{k)}[l~Zi{S{k)}]~^),  ke{l,2,...,K-l},  l<j<N,  (13) 

S{k) 

where  the  maximisation  in  (13)  is  over  2^'°'^^^ . K  \} 

While  there  are  efficient  algorithms  for  computing  the  indices  in  (13)  (including  the  adap¬ 
tive  greedy  algorithm  of  Robinson  (1982)  or  the  “restart-in-fc”  construction  of  Katehakis  and 
Veinott  (1987))  we  now  introduce  plausible  assumptions  regarding  the  system’s  stochastic 
structure  which  greatly  simplify  index  structure. 

Assumptions 

(1)  For  all  j,  increasing  in  k  for  each  choice  of  m  6  {k,k  +  I, . . .  ,K}-, 

(2)  For  all  j,  Oj{k)  is  decreasing  in  k. 


Assumption  (1)  states  that  in  the  engagement  discussed  in  (iii)  above,  Blue’s  new  damage 
state  Xj{t  +  1)  is  stochastically  increasing  in  its  old  damage  state,  Xj{t).  Assumption  (2) 
states  that  Blue  j  becomes  less  lethal  to  Red  as  it  is  increasingly  damaged. 
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We  proceed  to  develop  index  Gj(k)  by  introducing  quantities 


Zi  =  + 

m=l 

=  -  6i{m}Zi,  where  =  1. 


(14) 


Lemma  5  The  quantity  Zl  is  increasing  in  k,  for  each  j,  1  <  j  <  N. 

Proof  It  is  plain  that  Z^_j  <  4  =  1-  We  shall  proceed  by  induction,  will  suppose  that 
zLi  <  ZU2  < ...  <  and  will  show  that  the  inequality  Zl  <  Zl_^^  follows. 

First  observe  that,  from  Assumption  (2)  and  the  inductive  hypothesis,  it  follows  that 

{1  -  e^ik  +  i)}zi^,  <  {1  -  e^ik  +  2)}zi_,  < . . .  <  {1  -  e^{K)}zi  =  i. 

Now,  utilising  Assumption  (1)  we  have  that 

zu,=0  E  pLii{^-m}zi 

l=M 

=  l3{l-6j{k  +  l)}Zl, 

+  /3  E  [{1  - emzi  -  {1  - e,(i  - 

ls=k+2  m=l 

>0{i-e^{k  +  i)}zi^, 

+  /3  E  [{1  -  ej{i)}zi  -  {1  -  ejii  -  (15) 

l=k+2  m=l 

Similarly,  we  have  that 

zi=pf2pii{^-m}zi 

l^k 

K 

=  ppi,{i  -  ej{k)}zi  +  0{i  -  e^ik  +  i)}zl,{  E  Pi.,) 

i=k-n 

+  pf2[{i-emzi-{i-dj{i-i)}zU{j2PU-  (16) 

l=k+2  m=l 

Prom  (15)  and  (16)  we  infer  that 

zi,ji  -  /3{i  -e^[k  + 1)}]  >  z^[i  -  ppi,{i  -  e^ik)}]  +  zi^,p{i  -  djik  +  i)}(i  -  pL) 
and  hence  that 

Ci[i  -  (^pL{^  -  %(^ + 1)}]  >  zi[i  -  ppLn  -  m}]-  (17) 

We  now  use  9j(k)  >  dj{k  +  1)  together  with  (17)  to  infer  that  Zl  <  Zl_^^.  The  induction 
goes  through  and  the  proof  is  concluded.  □ 
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Theorem  6  Under  Assumptions  (1,  2),  Blue  index  Gj{k)  is  given  by 

Gj{k)=RjZi(l-Zi)-\  ke{l,2,...,K~l},  l<j<N, 
and  is  increasing  in  k. 


Proof  Suppose  that  Xj{0)  =  k€{l,2,...,K-l}  and  that  stopping  time  t  has  associated 
stopping  set  S{k)  as  in  (12),  for  which  P{Xj{r)  e  S{k)}  >  0.  Use  r  for  the  stopping  time 
corresponding  to  S{k)  =  (p.  Plainly  t  <  r  with  probability  one.  Utilising  the  above  defined 
quantities  we  have  that 

Gi{k,7)  =  RjZiil  -  Zi)~^ 

1  -  Ziism  +  E{P^I{Xjir)  €  5(fc)}[l  -  • 

But  from  Lemma  5  we  have  that, 


Xj{T)  e  S{k) 


EjExj{T)i^  ' 


>  RiZiil 


■Zir^=Gj{k,r). 


Prom  (18)  and  (19)  it  follows  that 


(19) 


RjZi{S(k)}  +  G,{k,l^E{/3^I{X,{T)  6  g(fc)}[l  -  4, .(,)]) 

1  -  Zi{S{k)}  +  E{ldrI{X,{r)  e  5(fc)}[l  -  4^.(,)])  ■ 

It  now  follows  immediately  from  (20)  that 

G^(fc,r)  >  RiZi{S{k)}ll  -  Zl{S{k)}r^  =  GAk,T)  (21) 


tor  any  r  and  associated  stopping  set  S{k).  The  result  immediately  follows  from  (21)  and 
the  form  of  the  index  Gj(k)  given  in  (13).  The  increasing  nature  of  Gj{k)  follows  from  the 
increasing  nature  of  Zl,  reported  in  Lemma  5.  □ 


Comments 

(a)  Under  Assumptions  (1,2),  the  increasing  nature  of  index  Gj{k)  in  k  means  that  in  an 
optimal  policy  Red  will  engage  each  Blue  continually  until  the  latter  is  killed  (unless 
Red  dies  first).  This  approach  is  intuitive  since  Blue’s  accumulating  damage  through 
his  engagements  not  only  brings  his  own  death  closer  (Assumption  (1)),  but  also  makes 
him  progressively  less  lethal  to  Red  (Assumption  (2)).  Hence  it  is  clear  that  Red  should 
continue  shooting  at  a  partly  damaged  Blue  and  the  index  policy  guarantees  that  this 
is  so. 

(b)  To  see  how  the  index  Gj{k)  depends  upon  Blue  j’s  lethality,  consider  two  extreme 
cases.  Suppose  first  that  Blue  j  is  lethal  right  up  to  its  own  destruction,  namely 

ej{l)  S  1,1  <  1  <  A"  -  1. 

It  then  follows  that 


'm. 


k,K 


11 


and  hence  that 


(22) 

Any  shot  by  Red  at  such  a  Blue  is  a  gamble  that  the  latter  will  be  killed  with  a  single 
shot.  Suppose  now  that  Blue  j  poses  very  little  retaliatory  threat  to  Red  in  that 

0j(l}  =  0,  1<1<K-1. 

Consider  the  quantities  {Z^ ,1  <  I  <  K  —  1}  satisfying  the  recursions 

K 

Zl  =  pP^^  +  pY^PlJi,l<l<K-l,  4  =  1. 

m=l 


We  now  have 


Gj(k)  -  R^Ziil  -  Zi)-^  (23) 

and  Red’s  only  concern  now  is  the  speed  with  which  Blue  j  can  be  killed  and  the  return 
Rj  claimed.  Not  surprisingly,  the  index  in  (22)  will  be  smaller  than  that  in  (23). 


5  Model  3  -  ‘Shoot-look-shoot’  for  Red 

Our  goal  here  is  to  give  the  reader  insight  concerning  the  generality  of  our  modelling/solution 
approach  by  introducing  developments  of  Model  1  of  considerable  practical  import.  The 
general  scenario  and  Ul,Riy,ri,  and  /?  are  all  as  before.  However,  now  we  shall  suppose 
that  after  every  shot  by  Red,  the  targetted  Blue  is  inspected  and  categorised  (with  error) 
according  to  Blue  target  type  and  alive/dead.  Write  (5  G  {1, 2, . . . ,  B}  x  {alive,dead}  for  a 
generic  classification.  We  have  that 

P[Blue  judged  to  be  (5|Blue  is  alive  of  type  b]  =  4>sb 
P[Blue  judged  to  be  (5|Blue  is  dead  of  type  i]  = 

where  1  <  6  <  B.  We  shall  also  suppose  that  Red’s  vulnerability  depends  upon  whether  the 
targetted  Blue  is  alive  or  dead.  We  use  9b  for  the  probability  that  Red  is  killed  during  an 
engagement  in  which  he  targets  a  Blue  of  type  b  who  is  still  alive.  This  becomes  9b  if  the 
targetted  Blue  is  dead. 

Red  now  gathers  information  about  the  Blues  he  is  facing  through  the  series  of  engage¬ 
ments  in  a  more  complicated  way  than  for  Model  1.  Index  policies  will  remain  optimal,  but 
the  index  structure  will  be  more  complex  and  simple  closed  forms  as  in  (10)  and  Theorem  6 
above  should  not  be  expected.  Consider  Blue  target  j  with  assigned  prior  At  time  t,  if 
Red  is  still  alive  then  sufficient  statistics  from  the  history  of  Red’s  past  engagements  which 
targetted  Blue  j  which  determine  Red’s  posterior  distribution  for  this  Blue  are: 

(a)  the  number  of  engagements  targetting  Blue  j  (n) ; 

(b)  the  outcomes  of  Red’s  subsequent  inspections  (5  =  {<5i,  (52,  ■  •  ■ ,  ^n}). 
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We  take  these  sufficient  statistics  as  Blue  j’s  state  at  t  while  Red  is  alive  and  write 
=  (^i  ^)-  Red’s  posterior  probability,  given  this  history,  that  Blue  j  is  of  type  6  and  is 
still  alive  is  proportional  to 


nj(i  -  =  n|n(n.  s)  =  •(«)}.  (24) 

Red’s  posterior  probability,  given  this  history,  that  Blue  j  is  of  type  b  but  is  now  dead  is 
proportional  to 

m  -  e,)\i  -  4)"-'=  n  n  hi 

fe=l  \/=l  /  \l=k 

^niP,{n,S)=U^,{Xj{t)},  (25) 

as  before.  Hence,  given  the  history  summarised  by  Xj{t),  Red’s  posterior  probabilities  for 
Blue  j  are  given  by 


R[Biue  j  is  alive  and  of  type  b\Xj(t)]  = 


Y.tiK[Pi{xM  +  Pd{xM] 


l<b<B,  (26) 


and 


P[Blue  j  is  dead  and  of  type  b\Xj{t)] 


^iPi>{xm 


Etini[P.{x,(t)}  +  Prf{x,(t)}]’ 

1  <  6  <  B.  (27) 

Our  scheduling  problem  may  be  represented  within  the  formulation  of  Section  2  (i)-(iv) 
as  follows; 


(i)  State  space  is  the  set  of  all  possible  histories  (n,  ^).  Since  in  general  under  this 
model,  Red  can  never  be  certain  that  Blue  j  has  been  killed,  there  is  no  state  IDj. 

(ii)  Should  action  Uj  be  chosen  at  t  when  Xj(t)  =  (n,  J)  there  are  two  modes  of  transition 
to  Xj{t  +  1),  depending  upon  whether  Red  is  killed  in  the  engagement  or  not.  If  Red 
is  not  killed,  we  have  a  state  transition  of  the  form 

(n,  d)  =  Xj{t)  ^  X,(t)(<J)  =  {n  +  1,  (J,  J)}  (28) 


with  probability 

Ef=i  n;(n{^j(0}{(i  -  u)(i  ^  +  r6(i  -  +  p,{xm}{i  -  ,291 

ElMmxAt)}  +  P^{Xjm] 

If  Red  is  killed  then  Xj{t  +  1)  =  oJj.  This  happens  with  probability 


Eti  HmxAtm + i^ixAtm 

ElMlPAxAt)}  +  PAxAt)}] 
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In  order  to  develop  the  indices  which  determine  optimal  shooting  policies  for  Red,  it  will 
assist  notationally  if  we  drop  the  Blue  target  identifier  j  and  use  H  for  a  generic  sufficient 
history  of  the  form  (n,  <5)  above.  We  wish  to  obtain  G{Tl,H),  namely  the  index  for  a  Blue 
with  prior  11  and  history  H.  We  shall  use  an  adapted  version  of  the  “restart-in- if”  approach 
to  index  computation  proposed  by  Katehakis  and  Veinott  (1987).  See  also  Glazebrook  and 
Greatrix  (1995).  To  this  end,  use  Q{H)  for  the  set  of  histories  reachable  (in  the  obvious 
sense)  from  history  H  and  for  the  set  of  bounded  real- valued  functions  on  n(if). 

The  “restart-in-ii”  problem  appropriate  for  index  computation  is  a  Markov  Decision 
Problem  with  initial  state  H.  Actions  at  each  stage  of  the  process  are  either  that  Red 
should  engage  with  Blue  or  that  the  current  state  should  be  reset  to  H  and  that  Red  should 
then  engage  with  Blue.  Transition  probabilities  and  returns  are  as  above.  The  process  is 
terminated  by  Red’s  death  at  any  stage.  Prom  Katehakis  and  Veinott  (1987)  we  infer  that 
G(ri,  H)  is  the  value  function  for  this  problem.  To  obtain  G(n,  H)  we  use  the  following 
iterative  scheme:  let  u  E  B{Q(H)}  and  H'  E 

Consider  the  transform  T/f  :  B(Q(H)}  B{Q(ff)}  given  by 


{Th(u)}(II')  =  max  I 


+Q{1  -  r(,)(l  -  9t)Y^(l>sbu{H'{5)} 


-  0b)  E 

6=1  S 

X  (Y.^b{Pb{H') +  Pb(H')} 
\b=l 

PiTb  +  Prb{l  -  9b)  ^  (t>gbbt{II{S)}  +  P{1  -  ri,)(l  -  Ob)  ^  (l>sbu{H(6)} 


+  ^niA(//)/9(l-ei)I]«is5“{^(5)}  EW(i7)  +  Pi(^?)}  >.  (31) 


We  now  use  Tg  for  an  n-fold  application  of  T^,  namely 

Th  =  Th  and  T’ji  =  T„(Tr%  n>2. 

Standard  results  concerning  value  iteration  for  discounted  Markov  Decision  Processes  (see, 
for  example,  Ross  (1970))  yield  the  value  function  for  the  restart-in-//  problem  as 

lim  {T^(u)}(li)  =  G(n,  H)  for  all  u  E  B{Q.{H)}.  (32) 

n—yoo 

Theorem  7  summarises  the  results  of  the  above  analysis  for  this  case. 


Theorem  7  If  Red  is  still  alive  at  t  then  he  optimally  targets  any  Blue  j*  for  which 
G^.{ff-,  A,.(i)}  =  m^^G,.{ff  ,X,.(t)}, 

where  the  indices  are  determined  by  the  iterative  scheme  in  (31)  and  (32). 
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6  Numerical  Study 

We  report  on  the  outcome  of  a  simulation  study  whose  aim  is  to  explore  statistical  properties 
of  the  optimal  (index)  shooting  policy  and  other  competitor  policies  for  Red.  This  study 
will  be  in  the  context  of  instances  of  a  minor  variant  of  Model  1  for  which  N  =  10  (ten  Blue 
targets)  and  B  =  5  (five  Blue  types).  In  this  variant  we  suppose  that  Red  cannot  be  killed 
in  any  engagement  in  which  he  kills  a  Blue  opponent.  Table  1  contains  details  of  the  Blue 
types.  The  reader  will  observe  that  model  parameters  have  been  chosen  such  that  the  Blues 
which  yield  highest  rewards  for  Reds  are  the  most  difficult  to  kill.  Targetting  these  also 
makes  Red  more  vulnerable.  Red  must  strike  an  optimal  balance  between  garnering  returns 
from  Blue  kills  and  remaining  alive. 


b 

n 

Rb 

1 

0.9 

0.2 

50 

2 

0.7 

0.3 

125 

3 

0.5 

0.4 

250 

4 

0.3 

0.5 

500 

5 

0.1 

0.6 

1000 

Table  1;  Details  of  the  five  Blue  types 


The  study  consisted  of  40,000  runs  -  with  10,000  runs  being  conducted  under  each  of  four 
different  policies  for  Red,  For  each  run,  the  50  probabilities  Ilj  are  drawn  independently 
from  a  t/(0, 1)  distribution  and  normalised  to  achieve 

5 

En^6  =  i. 

Discount  rate  ()  was  set  equal  to  0.95  in  all  cases.  The  four  shooting  policies  for  Red  are  as 
follows: 

(I)  Index  Policy  -  This  is  the  policy  which  maximises  the  expected  return  earned  by  Red 
before  his  own  death; 

(II)  Myopic  Policy  -  Here  Red’s  policy  is  to  shoot  next  at  whichever  Blue  is  still  alive 
and  offers  him  the  highest  one-stage  return.  If  Blue  j  has  prior  and  has  had  n 
inconclusive  engagements  with  Red  to  date,  this  one-stage  return  is  given  by 

ni(i  -  u)'-(i  -  06)"B,r, 1 1  n^,(i  -  n)"(i  -  e,r 

6=1  J  1 6=1 

(III)  Random  Policy  -  At  each  stage,  Red  chooses  between  the  still-alive  Blues  at  random, 
with  all  Blue  targets  equally  likely; 

(IV)  Round  Robin  Policy  -  Red  cycles  around  the  Blue  targets  (which  are  still  alive)  in 
numerical  order.  The  first  target  is  chosen  at  random. 
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For  each  policy,  Tables  2  ajid  3  contain  summaries  of  the  10,000  runs  conducted.  Table 
2  gives  a  statistical  summary  of  the  returns  earned  by  Red  and  records  the  mean  return,  the 
minimum  (Min),  lower  quartile  (LQ),  median  (Med),  upper  quartile  (UQ)  and  maximum 
(Max).  Table  3  gives  a  similar  summary  for  the  number  of  Blue  targets  destroyed  by  Red 
before  he  himself  is  killed.  The  final  column  of  Table  3  also  gives,  for  each  policy,  the 
percentage  of  runs  for  which  Red  is  killed  (i.e.  before  all  the  Blues  are). 


Policy 

Mean 

Min 

LQ 

Med 

UQ 

Max 

Index 

402.53 

0.00 

50.00 

250.00 

600.00 

3208.32 

Myopic 

351.25 

0.00 

0.00 

125.00 

500.00 

3393,94 

Random 

359.36 

0.00 

0.00 

168.75 

525.00 

3353.82 

Round-Robin 

363.73 

0.00 

0.00 

168.75 

546.52 

3476.21 

Table  2:  Summary  of  Red’s  returns  using  four  different  shooting  policies 


Policy 

Mean 

Min 

LQ 

Med 

UQ 

Max 

%  Red  killed 

Index 

2.41 

0 

1 

2 

4 

10 

99.23 

Myopic 

1.52 

0 

0 

1 

2 

10 

98.90 

Random 

1.91 

0 

0 

1 

3 

10 

99,01 

Round-Robin 

1.87 

0 

0 

1 

3 

10 

99.24 

Table  3:  Summary  of  number  of  Blues  killed  and  Red’s  death  rate  for  four  shooting  policies 

for  Red. 


That  the  index  policy  should  outperform  the  others  with  regard  to  its  mean  total  return 
is  guaranteed  by  Theorem  1.  What  is  of  note  from  the  numerical  results  is  its  comprehen¬ 
sive  dominance  of  the  alternatives  studied  with  regard  to  all  summary  measures  of  returns 
obtained  and  targets  killed.  The  poor  performance  of  the  myopic  policy  is  rooted  in  its 
indifference  to  the  issue  of  Red’s  vulnerability  when  targetting  different  Blues.  Its  very  low 
median  return  (half  that  of  the  index  policy)  speaks  of  many  conflicts  in  which  Red  is  killed 
very  early.  The  evidence  from  the  study  is  that  in  this  context  it  would  be  better  for  Red 
to  shoot  at  random  (or  in  a  round  robin  fashion)  than  myopically.  In  addition  to  maximis¬ 
ing  returns,  the  index  policy  also  outperforms  the  others  with  regard  to  numbers  of  Blues 
killed  -  see  Table  3.  The  probability  that  Red  does  not  survive  the  conflict  is  roughly  policy 
independent. 

7  Extensions  and  comments 

A  range  of  extensions  to  the  models  discussed  in  Sections  2-5  is  possible  for  which  index 
policies  either  remain  optimal  or  (at  least)  continue  to  perform  well.  See  Gaver,  Glazebrook 
and  Pilnick  (1991)  for  a  discussion  of  such  model  elaborations  in  a  different  problem  context 
and  Glazebrook,  Gaver  and  Jacobs  (2001)  for  a  discussion  which  focusses  specifically  on  a 
variant  of  Model  1.  Important  among  such  model  developments  are  those  which  acknowledge 
that  Red  has  a  finite  number  of  bullets  only.  We  note  the  following: 
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(a)  If  Red  has  a  finite  number  of  bullets  then  we  have  “finite  horizon”  versions  of  the 
(potentially  infinite)  scenarios  analysed  in  preceding  sections.  The  index  policies  de¬ 
veloped  there  will  continue  to  be  optimal  for  Red  in  the  so-called  deteriorating  cases 
in  which  each  Blue’s  index  decreases  almost  surely  after  each  inconclusive  engagement. 
This  will  happen,  for  example,  in  Model  1  when  Hj{n)  (see  (11))  is  decreasing  in  n 
for  each  j.  Note  from  Lemma  3  that  when  B  =  2,  the  Hj  are  guaranteed  to  be  ei¬ 
ther  all  increasing  or  all  decreasing.  The  proof  contains  a  condition  under  which  the 
decreasing  case  is  guaranteed.  Other  versions  of  the  “finite  horizon”  problem  outside 
of  the  deteriorating  case  are  not  indexable  in  general,  but  index  policies  will  usually 
continue  to  perform  very  well.  Mitchell  (2003)  has  conducted  a  numerical  study  of 
the  performance  of  index  policies  for  a  version  of  Model  2  in  which  the  Blues  do  not 
retaliate.  In  the  interests  of  brevity,  we  shall  omit  further  details  other  than  to  point 
out  that  the  broad  approach  to  the  numerical  investigation  was  as  in  the  study  outlined 
in  Section  6.  For  scenarios  in  which  Red  was  hmited  in  the  numerical  investigation 
to  25,  50  and  100  bullets  respectively  and  for  which  Red  had  an  infinite  supply,  his 
expected  return  from  killing  Blues  was  estimated  for  four  shooting  policies.  These 
policies  broadly  correspond  to  those  considered  in  Section  6.  In  Table  4  below,  find 
values  of  where  we  use  rIndex ^j^policy 

the  total  expected  returns  for  Red  under  the  index  policy  and  under  any  specified 
shooting  policy  respectively.  Please  note  the  outstandingly  strong  performance  of  the 
index  policy  in  the  short  horizon  (25  bullet)  case.  By  the  time  Red  is  assumed  to  have 
50  bullets,  the  profile  of  returns  is  much  as  in  the  infinite  horizon  case.  The  reader 
should  note  that,  in  contrast  to  the  results  in  Tables  2  and  3,  different  lethalities  of 
the  Blues  are  no  longer  present  to  undermine  the  performance  of  the  myopic  policy. 


Number  of  bullets  available  to  Red  I 

Policy 

25 

50 

100 

oo 

Myopic 

7.84% 

5.78% 

5.73% 

5.80% 

Random 

60.43% 

44.56% 

43.83% 

43.95% 

Round-robin 

38.00% 

29.51% 

29.07% 

29.02% 

Table  4:  Percentage  return  lost  when  Red  implements  a  shooting  policy  other  than  the 

index  policy. 


(b)  In  Models  1  and  3,  each  Blue  target  is  supposed  to  have  fixed  characteristics,  which 
however  may  be  imperfectly  known  by  Red.  In  Model  1  these  are  summarised  by  the 
triple  {rb,6i,R];)  for  Blues  of  type  h.  In  Model  2  target  characteristics  are  dynamic, 
and  change  by  means  of  a  process  of  accumulating  damage  caused  by  Red’s  shots. 
The  approach  described  in  the  paper  and  the  general  model  of  Section  2  can  easily 
accommodate  evolution  of  Blue  target  characteristics  during  targetting  by  Red.  How¬ 
ever,  we  may  wish  to  model  target  dynamics  while  not  underfire.  For  example.  Blue 
may  wish  to  re-deploy  alive  targets  not  under  current  fire  so  as  to  be  more  lethal  to 
Red.  This  possibility  takes  us  into  a  class  of  decision  processes  which  are  a  generalised 
form  of  the  restless  bandit  problems  of  Whittle  (1988).  While  restless  bandit  problems 
are  intractable  in  general.  Whittle  proposed  an  index  heuristic  (well  defined  under 
given  conditions).  These  index  heuristics  have  proved  outstandingly  effective  in  other 
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application  contexts.  See,  for  example,  Glazebrook,  Lumley  and  Ansell  (2003)  and 
Glazebrook  and  Mitchell  (2002).  The  first  author  is  conducting  an  extensive  research 
programme  in  this  challenging  yet  important  area. 

In  closing,  we  briefly  consider  issues  for  the  Blue  force.  For  definiteness,  the  discussion  will 
again  be  conducted  in  the  context  of  Model  1,  discussed  in  Section  3.  A  natural  first  question 
for  the  controller  of  Blue  concerns  what  force  he  needs  to  deploy  to  destroy  an  optimally 
shooting  Red  with  a  given  large  probability  1  —  e.  This  turns  out  to  be  straightforward 
to  assess.  Suppose  that  Ni,  Blues  of  type  b  are  deployed,  1  <  b  <  B.  The  probability  of 
Red’s  ultimate  survival  (having  destroyed  all  Blues)  does  not  depend  upon  his  strategy  for 
engaging  them.  Hence,  we  may  suppose  that  Red  targets  each  Blue  in  a  continuous  set  of 
exchanges  until  one  or  both  are  destroyed.  In  such  an  engagement  it  is  easy  to  show  that 

^4,  l<b<  B, 

Ai,,  l<b<B, 


P(Red  survives  and  kills  Blue  of  type  b)  = 
P(Red  is  killed  and  Blue  of  type  b  survives)  = 


ri,(l  -  0b) 
n  +  -  n) 

(1  -  n)di, 
n  +  ^'4(1  -  n) 


and 

P(Red  and  Blue  of  type  b  are  both  killed)  = - - r  s  0;,,  1  <b  <  B. 

H  +  ^4(1  -  n) 

Hence  the  probability  that  Red  survives  the  battle  with  type  b  Blues,  1  <  6  <  P,  is  given 
by 

4=1 

and  this  is  required  to  be  no  greater  than  e. 

If  we  now  ask  how  the  Blue  force  should  accomplish  the  destruction  of  Red  with  given 
probability  at  least  cost  to  itself,  then  Red’s  shooting  strategy  does  come  into  play  since, 
for  example.  Red  may  tend  to  target  “expensive”  Blues  first.  Now,  suppose  that  Red  shoots 
optimally  with  a  single  weapon  and  consider  a  simple  scenario  for  Model  1  in  which  B  =  2 
and  all  indices  are  increasing  in  n.  See  Lemma  3.  Hence  Red’s  optimal  policy  targets  each 
Blue  continuously  until  one  or  other  is  destroyed.  Write  G{Ni,N2)  for  the  expected  cost  to 
the  Blue  force  of  the  deployment  of  A),  type  6’s,  6=1,2,  against  an  optimally  shooting  Red. 
We  shall  assume  here  that  /3  =  1-  Blue’s  optimisation  problem  is 

min  C(Ni,N2) 

subject  to  <  £.  (33) 

We  now  describe  a  scenario  in  which  C{Ni,N2)  may  be  computed  easily.  Suppose  that 
Red’s  prior  distributions  for  the  Blues  he  faces  are  obtained  by  moderating  initial  ignorance 
about  them  (expressed  by  /^(Blue  is  supposed  to  be  of  type  6)=  0.5,  6  =  1,2)  by  means  of 
information  obtained  from  a  sensor.  This  sensor  can  only  judge  Blue  type  with  error.  We 
have 

P(Blue  judged  to  be  of  type  61  j  Blue  is  of  type  62)  =  iphih^ 
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for  all  choices  of  bi,  b^.  Hence  Red  allocates  to  each  Blue  one  of  two  possible  priors  H**,  6  =  1,2 
according  to  the  judgement  of  the  sensor.  We  have 

nj  =  F(Blue  is  of  type  l|Blue  judged  to  be  of  type  b)  =  — . 

<Pi.l  +  <P62 

Let  Xi  be  a  binomial  Bin{Ni,4>u)  random  variable  representing  the  number  of  the  Ni  type 

1  Blues  judged  by  the  sensor  to  be  of  type  1  and  hence  given  prior  H^  by  Red.  Similarly 
X2  ~  Bin{N2,  ^22).  Red  faces  X1  +  N2  —  X2  Blues  to  which  he  allocates  prior  H^  and  initial 
index  Cri(O)  and  A^i  —  Xj  +  X2  Blues  to  which  he  allocates  prior  IP  and  initial  index  G2(0). 
Suppose  that  Gi(0)  >  G2(0)  and  so  Red  engages  first  all  Blues  judged  to  be  of  type  1.  If 
Red  faces  two  or  more  Blues  with  the  same  index,  he  chooses  between  them  at  random.  Now 
write  c(6i,  62)  for  the  expected  cost  to  Blue  when  Red  engages  61  type  1  Blues  and  62  type 

2  Blues  in  random  order.  If  G(,  is  the  cost  of  deploying  a  single  Blue  of  type  b,  then 

(61  +  62)0(61, 62)  =  iijCi)!!/!  +  ©i)  +  3'ic(6i  —  1, 62)}  +  62{G2(4'2  +  ©2)  +  ^2c((6i,  62  —  1)}, 

c(0,0)  =0 

which  enables  recursive  calculation  of  any  0(61,62).  We  deduce  that  the  expected  cost  to 
Blue  of  the  chosen  deployment  is  given  by 

G(iVi,  JV2)  =  E  {c(Xx,N2  -  X2)  +  ~  Xi,  X2)} 

and  this  may  now  be  used  in  (33).  In  more  complicated  situations,  Blue’s  expected  cost 
may  be  computed  via  suitable  development  of  the  methodologies  described  by  Bertsimas 
and  Nino-Mora  (1996)  for  multi-armed  bandits. 
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